Using Statistical Features to Find Phrasal Terms in Text Collections
نویسندگان
چکیده
In this work we investigate alternatives to automatically detect phrasal terms, defined here as phrasal verbs, phrasal nouns, phrasal adjectives or phrasal adverbs found in a text. The automatic identification of phrasal terms may have several applications in text processing systems. We approach this problem and present a novel approach for detecting phrasal terms in a collection of documents. Our solution is based on machine learning and uses statistical features of the word n-grams found in the documents. We also investigate the particular impact of adding phrasal terms in the retrieval model of a search engine when processing queries on several data sets. Our results show that we are able to discover valid phrasal terms with a small error rate, achieving detection results ranging from 70% to 94% in terms of F1. Furthermore, the discovered phrasal terms, when used to enhance search tasks, allow improvements in retrieval performance of up to 11% in terms of MAP when considering all queries, and up to 36% in terms of MAP when considering only the queries that contained the detected phrasal terms.
منابع مشابه
Phrasal: A Toolkit for New Directions in Statistical Machine Translation
We present a new version of Phrasal, an open-source toolkit for statistical phrasebased machine translation. This revision includes features that support emerging research trends such as (a) tuning with large feature sets, (b) tuning on large datasets like the bitext, and (c) web-based interactive machine translation. A direct comparison with Moses shows favorable results in terms of decoding s...
متن کاملThe Effect of Conceptual Metaphor Awareness on Learning Phrasal Verbs by Iranian Intermediate EFL Learners
The ability to comprehend and produce phrasal verbs, as lexical chunks or groups of words which are commonly found together, is an important part of language learning. This study investigates the effect of ‘conceptual metaphor awareness’, as a newly developed technique in Cognitive Linguistics, on learning phrasal verbs by Iranian intermediate EFL learners. To meet this objective, two intact ho...
متن کاملUsing the Text Corpus to Create a Comprehensive List of Phrasal Verbs
The paper describes extraction of Estonian multi-word verbs from text corpora, using a languageand task-specific software tool SENVA, which is based on a statistical language-independent software tool SENTA (Dias et al, 2000). The outcome is a comprehensive list of 16,000 phrasal verbs. We describe the extraction tool, manual post-editing principles, and evaluate the outcome in terms of precisi...
متن کاملIdentifying Phrasal Verbs Using Many Bilingual Corpora
We address the problem of identifying multiword expressions in a language, focusing on English phrasal verbs. Our polyglot ranking approach integrates frequency statistics from translated corpora in 50 different languages. Our experimental evaluation demonstrates that combining statistical evidence from many parallel corpora using a novel ranking-oriented boosting algorithm produces a comprehen...
متن کاملUsing Phrasal Verbs as an Index to Distinguish Text Genres
Previous studies have shown that text genres can be computationally distinguished by sophisticated computational and statistical methods. The current study adds to the previous body of work by incorporating phrasal verbs as a text genre identifier. Results indicate that phrasal verbs significantly distinguish between both the spoken/written and formal/informal dimensions, with considerably less...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- JIDM
دوره 1 شماره
صفحات -
تاریخ انتشار 2010